Agenda

  1. Part 1: Talk about the basics: What is R? How to get help!?

  2. Part 2: R language fundamentals.

  3. Part 3: An extended example using HPCC

Part 1: Talk about the basics: What is R? How to get help!?

Before we start

We will be using:

The R programming language


A little bit of history

Picture by the New York Times https://nyti.ms/2GC3ruP
From left to right: John Chambers, Robert Gentleman, and Ross Ihaka.
  • R (R Core Team 2018) is an implementation of the S (Statistics) programming language, which was created in 1976 by John Chambers while at Bell Labs.

  • R itself was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. First release in 1995.

  • Currently developed by the R Development Core Team, of which Chambers is a member.

(Source wiki)

The first lesson: Getting help

in R

In R, if you want to:

  • Know about the sqrt function? ?sqrt, or help("sqrt")

  • Know about the makeCluster function in the R package parallel? ?parallel::makeCluster, or help("makeCluster", package="parallel")

  • Know about the Regular Expressions? ??"Regular Expressions", or help.search("Regular Expressions")

  • See a full list of the functions and help files available in the package boot: help(package="boot").

  • Look at more in deph information about the Matrix package? vignette(package="Matrix")


On the web


Books that I recommend

The first lesson: Getting help (How to read it?)

The first lesson: Getting help (a mental model)

My own personal way of looking for R-based solutions to my problems (in science… of course)

Questions

  1. Using the stats package, How can you estimate a generalized linear model in R?

  2. What is the command to transpose a matrix in R? What about the command for inverting a matrix?

  3. Looking at CRAN task Views, what R packages are available for convex optimization? What about stochastic optimization?

  4. Create a list of R packages that provide wrappers for working with Slurm.

  5. What does return the function for fitting nonlinear least squares in the stats package?

Part 2: R language fundamentals.

Creating objects


Data types


Attributes and Structure


Missing values

R has different types of Missing values:

str(c(NA, 1L))        # Integers can have NAs
##  int [1:2] NA 1
str(c(NaN, 1L, Inf))  # But not NaN or Inf (automatic coercion)
##  num [1:3] NaN 1 Inf
str(c(-Inf, 1, NULL, +Inf)) # And Nulls are... of length 0!
##  num [1:3] -Inf 1 Inf

Questions

  1. What is the mode of the following vector myvector <- c(NA, NaN, Inf)? (try not to use the mode() function in R)

  2. The c() function can be used with other vectors, for example

    my_integer_vector <- c(1L, 2L, 3L)    
    my_string_vector  <- c("hello", "world")

    What is the mode of the vector c(my_integer_vector, my_string_vector)?

  3. What do each one of these functions return on the vector myvector?

  4. What are the attributes of the following object mymat <- matrix(runif(12), ncol=4)?

Linear Algebra


Questions

Other fundamental types

Other fundamental type of objects in R are:


Statistical Functions


set.seed(12)
op <- par(mfrow = c(2,2))
hist(rnorm(1e5))
curve(qnorm)
curve(pnorm, xlim=c(-3, 3))
curve(dnorm, xlim=c(-3, 3))

par(op)

set.seed(12)
op <- par(mfrow = c(2,2))
hist(rexp(1e5))
curve(qexp)
curve(pexp, xlim=c(0, 6))
curve(dexp, xlim=c(0, 6))

par(op)

Questions

  1. Draw 1e5 samples from a chi2 with 2 degrees of freedom (hint: check ?Distributions).

  2. Draw 1e5 samples from a chi2 with 2 degrees of freedom using rnorm (hint: Recall that if \(X\sim N(0,1)\), then \(X^2\sim\chi^2_1\), and if \(X, Y\sim N(0,1)\), then \(X^2 + Y^2\sim\chi^2_2\)).

Control-flow statements



Functions




Part 3: An extended example using HPCC

Agenda

  1. High-Performance Computing: An overview

  2. Parallel computing in R

  3. Extended examples

First: How to use R on HPC

There are two things that you need to do to use R in HPC:

  1. Source the corresponding R version: For example, if you want to work with version 3.4, you could just type

    source /usr/usc/R/3.4.0/setup.sh

    You can also include that line in you ~/.bash_profile file so that is done automatically on your login.

  2. Specify the R library path: In order to be able to use R packages that where install in your session while running Slurm (for example), you have to specify the library path. There are a couple of ways of doing it:

    1. Use the .libPaths() command at the begining of your R script
    2. Use the lib.loc option when calling library()
    3. Use the .Renviron file and set the R_LIBS value (see ?Renviron)

We have examples at the end of the presentation.

High-Performance Computing: An overview

Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:

  1. Big data: How to work with data that doesn’t fit your computer

  2. Parallel computing: How to take advantage of multiple core systems

  3. Compiled code: Write your own low-level code (if R doesn’t has it yet…)

(Checkout CRAN Task View on HPC)

Big Data

Parallel computing

GPU vs CPU

When is it a good idea?

Ask yourself these questions before jumping into HPC!

Ask yourself these questions before jumping into HPC!

Parallel computing in R

While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism:

Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU.

Parallel workflow

  1. Create a cluster:

    1. PSOCK Cluster: makePSOCKCluster: Creates brand new R Sessions (so nothing is inherited from the master), even in other computers!

    2. Fork Cluster: makeForkCluster: Using OS Forking, copies the current R session locally (so everything is inherited from the master up to that point). Not available on Windows.

    3. Other: makeCluster passed to snow

  2. Copy/prepare each R session:

    1. Copy objects with clusterExport

    2. Pass expressions with clusterEvalQ

    3. Set a seed

  3. Do your call:

    1. mclapply, mcmapply if you are using Fork

    2. parApply, parLapply, etc. if you are using PSOCK

  4. Stop the cluster with clusterStop

parallel example 1: Parallel RNG

# 1. CREATING A CLUSTER
library(parallel)
cl <- makePSOCKcluster(2)    

# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`

# 3. DO YOUR CALL
ans <- parSapply(cl, 1:2, function(x) runif(1e3))
(ans0 <- var(ans))
#               [,1]          [,2]
# [1,]  0.0861888293 -0.0001633431
# [2,] -0.0001633431  0.0853841838
# I want to get the same!
clusterSetRNGStream(cl, 123)
ans1 <- var(parSapply(cl, 1:2, function(x) runif(1e3)))

all.equal(ans0, ans1) # All equal!
# [1] TRUE
# 4. STOP THE CLUSTER
stopCluster(cl)

parallel example 1: Parallel RNG (cont.)

In the case of makeForkCluster

# 1. CREATING A CLUSTER
library(parallel)

# The fork cluster will copy the -nsims- object
nsims <- 1e3
cl    <- makeForkCluster(2)    

# 2. PREPARING THE CLUSTER
RNGkind("L'Ecuyer-CMRG")
set.seed(123) 

# 3. DO YOUR CALL
ans <- do.call(cbind, mclapply(1:2, function(x) {
  runif(nsims) # Look! we use the nsims object!
               # This would have fail in makePSOCKCluster
               # if we didn't copy -nsims- first.
  }))
(ans0 <- var(ans))

# Same sequence with same seed
set.seed(123) 
ans1 <- var(do.call(cbind, mclapply(1:2, function(x) runif(nsims))))

ans0 - ans1 # A matrix of zeros

# 4. STOP THE CLUSTER
stopCluster(cl)

parallel example 2: Simulating \(\pi\)

The R code to do this

pisim <- function(i, nsim) {  # Notice we don't use the -i-
  # Random points
  ans  <- matrix(runif(nsim*2), ncol=2)
  
  # Distance to the origin
  ans  <- sqrt(rowSums(ans^2))
  
  # Estimated pi
  (sum(ans <= 1)*4)/nsim
}

parallel example 2: Simulating \(\pi\) (cont.)

# Setup
cl <- makePSOCKcluster(10)
clusterSetRNGStream(cl, 123)

# Number of simulations we want each time to run
nsim <- 1e5

# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))

# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
  parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
  serial   = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4]
#       test replications elapsed relative
# 1 parallel            1   0.633    1.000
# 2   serial            1   1.425    2.251
ans_par <- parSapply(cl, 1:100, pisim, nsim=nsim)
ans_ser <- sapply(1:100, pisim, nsim=nsim)
stopCluster(cl)
#      par      ser        R 
# 3.141561 3.141677 3.141593

Slurm Example 1

# Include this to tell where everything will be living at
.libPaths("~/R/x86_64-pc-linux-gnu-library/3.4/")

# Default CRAN mirror from where to download R packages
options(repos =c(CRAN="https://cloud.r-project.org/"))

# You need to have the ABCoptim R package
library(ABCoptim)

fun <- function(x) {
  -cos(x[1])*cos(x[2])*exp(-((x[1] - pi)^2 + (x[2] - pi)^2))
}

ans <- abc_optim(rep(0,2), fun, lb=-10, ub=10, criter=50)

saveRDS(
   ans,
   file = paste0(
      "~/hpc-with-r/examples/01-slurm-abcoptim-",
      Sys.getenv("SLURM_JOB_ID"),                 # SLURM ENV VAR
      "-",
      Sys.getenv("SLURM_ARRAY_TASK_ID"),          # SLURM ENV VAR
      ".rds"
))

Now you try it!

RcppArmadillo + OpenMP + Slurm: Using the rslurm package

Now you try it!

Thanks!

# R version 3.5.1 (2018-07-02)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 18.04.1 LTS
# 
# Matrix products: default
# BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
# LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] parallel  stats     graphics  grDevices utils     datasets  methods  
# [8] base     
# 
# loaded via a namespace (and not attached):
#  [1] compiler_3.5.1  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
#  [5] tools_3.5.1     htmltools_0.3.6 yaml_2.1.19     Rcpp_0.12.17   
#  [9] stringi_1.2.3   rmarkdown_1.10  highr_0.6       knitr_1.20     
# [13] stringr_1.3.1   digest_0.6.15   evaluate_0.10.1

See also

For more, checkout the CRAN Task View on HPC

References

Jones, O., R. Maillardet, and A. Robinson. 2009. Introduction to Scientific Programming and Simulation Using R. Chapman & Hall/Crc the R Series. CRC Press. https://books.google.com/books?id=gnZC525wnzIC.

Marchand, Philippe. 2017. Rslurm: Submit R Calculations to a Slurm Cluster. https://CRAN.R-project.org/package=rslurm.

Matloff, N. 2011. The Art of R Programming: A Tour of Statistical Software Design. No Starch Press Series. No Starch Press. https://books.google.com/books?id=o2aLBAAAQBAJ.

Peng, R. 2012. R Programming for Data Science. Lulu.com. https://books.google.com/books?id=GSePDAEACAAJ.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Vega Yon, George, and Enyelbert Muñoz. 2017. ABCoptim: An Implementation of the Artificial Bee Colony (ABC) Algorithm. https://github.com/gvegayon/ABCoptim.

Wickham, H. 2015. Advanced R. Chapman & Hall/Crc the R Series. CRC Press. https://books.google.com/books?id=FfsYCwAAQBAJ.

Wickham, H., and G. Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media. https://books.google.com/books?id=vfi3DQAAQBAJ.